Topic: How can B&W Tek LLC improve its proprietary chemometric modeling software, BWIQ, so that SIMCA models share more information with end customers both analyticall and graphically?
Soft independent modelling of class analogies (SIMCA) is a supervised classification technique that is commonly used in fields such as chemometrics and spectroscopy. One of the noteworthy aspects of SIMCA is that samples can be identified as belonging to multiple classes, which is what the word “soft” references. For each class of samples, run PCA on the observations for dimensionality reduction purposes. Each modeled class will be described by a flat affine n-dimensional subspace, with n being the number of principal components in the model. For a single principal component, model the class using a line; for two principal components, model the class using a plane. For three or more principal components, the class model is going to be a hyperplane. Next, calculate the mean orthogonal distance (OD) from the training set to the flat affine n-dimensional subspace and use that number to determine the threshold for classification, called the critical distance, which is found with the F-distribution. Now that each class has a dimensionally reduced model and a classification threshold, new observations may be tested against the classifier model.
The current iteration of BWIQ only gives the end user an accuracy value and a barebones plot after building a SIMCA model. In this exploration, I will employ and explain several analytical and graphical techniques that convey valubale information about the SIMCA model using spectra collected from a sample of calcium nitrate and a sample of magnesium nitrate with a NanoRam 785 nm device. One of the first things we can do after building the SIMCA model is use the summary() function in R to receive some enlightening statistics about the class model for calcium nitrate:
##
## SIMCA model for class "calcium nitrate" summary
##
## Info:
## Method for critical limits: jm
## Significance level (alpha): 0.05
## Selected number of components: 3
##
## Expvar Cumexpvar Sens (cal) Expvar (cv) Sens (cv)
## Comp 1 70.66 70.66 0.94 55.89 0.88
## Comp 2 4.90 75.56 1.00 2.22 0.81
## Comp 3 3.94 79.49 1.00 2.29 0.62
Next, we can call the very simple plot() function on the calcium nitrate model object and get several informative statistics:
In terms of SIMCA, the scores (also called factor scores or component scores) are the transformed variable values corresponding to particular data points. Another way to describe score is the distance between the point an observation is projected onto the principal component axis and the origin.
Modelling power shows the amount each variable contributes to the variation described by the principal components. The default representation in plot() is cluttered by each variable name being displayed over the vertical bars because there are so many variables in spectral data.
Squared residual distance (Q) and Hotelling’s T-squared are summary statistics typically used in factor-based models, which explain how well a model describes a given observation and assist with outlier detection. Q residual uses the sum of squares of each observation in the error matrix, which means the Q residual measures the difference between an observation and its projection into the space defined by the principal component eigenvectors. For the ith observation, calculate the Q residual using the following formula, where \(e_i\) is the ith row of the error matrix E:
\[Q_i = e_ie_i^T\]
Hotelling’s T-squared statistic uses the sum of normalized squared scores, which means it measures the variation within each observation in the PCA class model. Therefore, T-squared gives the distance between the projection of the observation onto the principal component eigenvectors and the multivariate mean. The Hotelling distribution is as follows, where S is the sample covariance matrix:
\[T^2 = (\bar{x} - \mu_0)^T S^{-1} (\bar{x} - \mu_0)\]
In summary, Q residuals demonstrate the magnitude of variation left in each observation after the observation is projected through the model, while Hotelling’s T-squared represent the distance each observation is from the center of the class model (scores = 0).
The cumulative variance plot is fairly straightforward; it shows the total percentage of the variation in the data that is explained using increasing numbers of principal components.
Finally, we can check out an interactive visualization of the scores for the calcium nitrate model class mapped in ℝ3 using the first three eigenvector axes:
Now we must do the same steps for the magnesium nitrate class model:
##
## SIMCA model for class "magnesium nitrate" summary
##
## Info:
## Method for critical limits: jm
## Significance level (alpha): 0.05
## Selected number of components: 3
##
## Expvar Cumexpvar Sens (cal) Expvar (cv) Sens (cv)
## Comp 1 85.76 85.76 0.94 68.37 0.94
## Comp 2 4.07 89.83 0.94 8.43 0.94
## Comp 3 1.91 91.74 1.00 1.55 0.88
We can take the single-class SIMCA models for magnesium nitrate and calcium nitrate and combine them into a simcam object using the aptly named simcam() function in the mdatools package, which allows for SIMCA multiclass classification.
nitrates <- simcam(models = list(ca_model, mg_model), info = 'Nitrate SIMCA')
Again, we will run the summary() function to get a cursory look at our model object:
summary(nitrates)
##
## SIMCA multiple classes classification (class simcam)
## Nmber of classes: 2
## Info: Nitrate SIMCA
##
## SIMCA model for class "calcium nitrate" summary
##
## Info:
## Method for critical limits: jm
## Significance level (alpha): 0.05
## Selected number of components: 3
##
## Expvar Cumexpvar Sens (cal) Expvar (cv) Sens (cv)
## Comp 1 70.66 70.66 0.94 55.89 0.88
## Comp 2 4.90 75.56 1.00 2.22 0.81
## Comp 3 3.94 79.49 1.00 2.29 0.62
##
## SIMCA model for class "magnesium nitrate" summary
##
## Info:
## Method for critical limits: jm
## Significance level (alpha): 0.05
## Selected number of components: 3
##
## Expvar Cumexpvar Sens (cal) Expvar (cv) Sens (cv)
## Comp 1 85.76 85.76 0.94 68.37 0.94
## Comp 2 4.07 89.83 0.94 8.43 0.94
## Comp 3 1.91 91.74 1.00 1.55 0.88
We can see that taking a summary of the simcam model object basically just pastes the summaries from the individual SIMCA model classes together. There isn’t much information gained, but it does offer utility in the organizational sector. A Cooman’s plot is a common way to graphically display the results for SIMCA. Cooman’s plots do pairwise comparison of the SIMCA classes by examining the distance from each observation in one class to the model of the other class. The plot is broken up into four regions with each region having a distinct meaning. Observations which are judged to belong to either of the two classes but not both classes fall in either the top left or the bottom right sectors, depending on which class they are determined to be in. Observations which are determined to fall into both classes will be in the bottom left sector, and observations which don’t belong to either class will appear in the top right sector.
Cooman’s plots are fairly useful for a small number of SIMCA classes, but it’s important to keep in mind that the number of plots required to examine all class pairings grows in quadratic time. Examining n SIMCA classes in such a manner will require \(\frac{n(n-1)}{2}\) Cooman’s plots. For situations where there are many classes to examine, it will be prohibitively difficult to examine the myriad plots in a reasonable amount of time.
The discrimination power plot tells us which variables in the observations contributed the most to separating the classes. Since the observations are spectral data, the discrimination power should be the greatest for the variables which are pixels of the main peaks of the Raman spectra. Similarly, discrimination power plots are created pair-wise between two classes at a time, so the number of plots will again grow quadratically as the number of classes increases.
We can use the versatile predict() function to run our calibration set against the SIMCA model and see a visual representation of the class predictions:
We can also do the same thing with the test set to get predictions against data not used to build the SIMCA class models:
A confusion matrix is another useful tool for letting a user know about the performance of their classification model, giving the number of false positives, false negatives, true positives, and true negatives. From these results, one can calculate several other measures of model performance such as sensitivity, specificity, F1 score, and accuracy. Here is an example of a confusion matrix for the calibration set:
## calcium nitrate magnesium nitrate None
## calcium nitrate 16 0 0
## magnesium nitrate 0 16 0
A user could see that all of the samples were correctly identified as either a true positive or a true negative and that there were no misclassifications in this case.
The methods which I have outlined in this overview should be sufficient to satisfy the needs of both customers and internal employees alike. I highly recommend adding the plots and analytics discussed in this reprt to the BWIQ software.
## R version 3.6.1 (2019-07-05)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 17763)
##
## Matrix products: default
##
## locale:
## [1] LC_COLLATE=English_United States.1252
## [2] LC_CTYPE=English_United States.1252
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C
## [5] LC_TIME=English_United States.1252
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] mdatools_0.9.4 factoextra_1.0.6 ggfortify_0.4.8
## [4] data.table_1.12.2 plyr_1.8.4 plotly_4.9.1
## [7] forcats_0.4.0 stringr_1.4.0 dplyr_0.8.3
## [10] purrr_0.3.2 readr_1.3.1 tidyr_1.0.0
## [13] tibble_2.1.3 ggplot2_3.2.1 tidyverse_1.2.1
##
## loaded via a namespace (and not attached):
## [1] tidyselect_0.2.5 xfun_0.9 haven_2.1.1
## [4] lattice_0.20-38 colorspace_1.4-1 vctrs_0.2.0
## [7] generics_0.0.2 htmltools_0.3.6 viridisLite_0.3.0
## [10] yaml_2.2.0 rlang_0.4.0 later_0.8.0
## [13] pillar_1.4.2 glue_1.3.1 withr_2.1.2
## [16] modelr_0.1.5 readxl_1.3.1 lifecycle_0.1.0
## [19] munsell_0.5.0 gtable_0.3.0 cellranger_1.1.0
## [22] rvest_0.3.4 htmlwidgets_1.3 evaluate_0.14
## [25] knitr_1.25 httpuv_1.5.2 crosstalk_1.0.0
## [28] broom_0.5.2 Rcpp_1.0.2 xtable_1.8-4
## [31] promises_1.0.1 scales_1.0.0 backports_1.1.4
## [34] jsonlite_1.6 mime_0.7 gridExtra_2.3
## [37] hms_0.5.1 digest_0.6.21 stringi_1.4.3
## [40] shiny_1.3.2 ggrepel_0.8.1 grid_3.6.1
## [43] cli_1.1.0 tools_3.6.1 magrittr_1.5
## [46] lazyeval_0.2.2 crayon_1.3.4 pkgconfig_2.0.3
## [49] zeallot_0.1.0 xml2_1.2.2 lubridate_1.7.4
## [52] assertthat_0.2.1 rmarkdown_1.15 httr_1.4.1
## [55] rstudioapi_0.10 R6_2.4.0 nlme_3.1-140
## [58] compiler_3.6.1